Google Cloud Datalab provides a first class experience for working with Google BigQuery for both adhoc, exploratory work as well as pipeline development. In addition to authoring and executing SQL queries, you'll likely want to inspect Datasets, Tables, and their Schema, and also sample data.
This notebook introduces various BigQuery commands that Datalab brings into the notebook environment.
You've already seen a BigQuery command in the Hello BigQuery notebook, they are invoked using %%bq
.
In [5]:
%bq --help
You can get help on any of the individual commands as follows:
In [2]:
%bq tables --help
The first thing you might decide to do is list datasets and tables contained within a specified project or the default project you're running this Datalab environment in.
Note that you will be able to list and use any BigQuery dataset that has been shared with the project (its service account) that you are running Datalab in.
In [1]:
%bq datasets list --project cloud-datalab-samples
Out[1]:
In [2]:
%bq tables list --dataset cloud-datalab-samples.httplogs
Out[2]:
In [3]:
%bq tables describe --name cloud-datalab-samples.httplogs.logs_20140615
Out[3]:
In [6]:
%bq sample --help
As you can see, Datalab allows you to sample a specified count of rows using various sampling strategies, including random sampling or hashing. Let's try retrieving a couple of samples.
In [7]:
%bq sample --table cloud-datalab-samples.httplogs.logs_20140615 --count 5
Out[7]:
You can also specify a list of fields to project.
In [8]:
%bq sample --table cloud-datalab-samples.httplogs.logs_20140615 --count 10 --fields timestamp,latency --method hashed --key-field latency
Out[8]:
Of course, since, querying BigQuery data using SQL is the primary scenario, it's always handy to have the BigQuery SQL reference. Datalab makes this easy to access by providing a direct link to the query reference via the Help Links button on the top toolbar.
Lets see some more query examples.
In [9]:
%%bq query
SELECT timestamp, latency
FROM `cloud-datalab-samples.httplogs.logs_20140615`
ORDER BY latency DESC
LIMIT 10
Out[9]:
That was a simple query. Now, let's do something a bit more interesting: let's compute the 50th percentile, 95th percentile, and 99th percentile latencies for request processing.
BigQuery makes this effortless with its built-in statistical and analytics functions, such as QUANTILES
.
In [10]:
%%bq query
SELECT
quantiles[SAFE_ORDINAL(50)] AS latency_50th_percentile,
quantiles[SAFE_ORDINAL(95)] AS latency_95th_percentile,
quantiles[SAFE_ORDINAL(99)] AS latency_99th_percentile
FROM (
SELECT APPROX_QUANTILES(latency, 100) AS quantiles
FROM `cloud-datalab-samples.httplogs.logs_20140615`
)
Out[10]:
Beyond tables, it almost always interesting to be able to visualize the data to get a more meaningful view of aggregates, trends and patterns.
Let's write another query, also using QUANTILES
. This time, the ROW_NUMBER
function will be used to also include a row number in the output data representing the percentile, using the number as the chart's x-axis value. In order to reference this SQL query later in the notebook, it is given a name via the --name/-n
argument. This query object can then be passed to a chart by referencing its name. The chart will execute the contained query.
In [11]:
%%bq query --name data
WITH quantiles AS (
SELECT APPROX_QUANTILES(LOG10(latency), 50) AS timearray
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE latency <> 0
)
select row_number() over(order by time) as percentile, time from quantiles cross join unnest(quantiles.timearray) as time
order by percentile
In [12]:
%chart columns --data data --fields percentile,time
Out[12]:
There are other commands, such as those that import (load) and export (extract) data or that handle tables and datasets.
Datalab allows queries to be constructed one step at a time to create composite SQL queries that use different constructs such as User Defined Functions and External Data Sources, in order to harness the full power of BigQuery SQL while managing authoring complexity.
All of these BigQuery commands are implemented on top of Python BigQuery APIs (in the google.datalab.bigquery
Python module). This implementation not only allows you to write arbitrary code and logic while working with BigQuery data, but also lets you integrate SQL and Python, and the Python data analysis libraries such as pandas and matplotlib, to perform sophisticated and custom data analysis and visualization tasks.
These topics are covered in other BigQuery tutorial notebooks that are included with Datalab.